15 research outputs found
Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization
We study the conditions under which one is able to efficiently apply
variance-reduction and acceleration schemes on finite sum optimization
problems. First, we show that, perhaps surprisingly, the finite sum structure
by itself, is not sufficient for obtaining a complexity bound of
\tilde{\cO}((n+L/\mu)\ln(1/\epsilon)) for -smooth and -strongly
convex individual functions - one must also know which individual function is
being referred to by the oracle at each iteration. Next, we show that for a
broad class of first-order and coordinate-descent finite sum algorithms
(including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated'
complexity bound of \tilde{\cO}((n+\sqrt{n L/\mu})\ln(1/\epsilon)), unless
the strong convexity parameter is given explicitly. Lastly, we show that when
this class of algorithms is used for minimizing -smooth and convex finite
sums, the optimal complexity bound is \tilde{\cO}(n+L/\epsilon), assuming
that (on average) the same update rule is used in every iteration, and
\tilde{\cO}(n+\sqrt{nL/\epsilon}), otherwise
Dimension-Free Iteration Complexity of Finite Sum Optimization Problems
Many canonical machine learning problems boil down to a convex optimization
problem with a finite sum structure. However, whereas much progress has been
made in developing faster algorithms for this setting, the inherent limitations
of these problems are not satisfactorily addressed by existing lower bounds.
Indeed, current bounds focus on first-order optimization algorithms, and only
apply in the often unrealistic regime where the number of iterations is less
than (where is the dimension and is the number of
samples). In this work, we extend the framework of (Arjevani et al., 2015) to
provide new lower bounds, which are dimension-free, and go beyond the
assumptions of current bounds, thereby covering standard finite sum
optimization methods, e.g., SAG, SAGA, SVRG, SDCA without duality, as well as
stochastic coordinate-descent methods, such as SDCA and accelerated proximal
SDCA
Communication Complexity of Distributed Convex Learning and Optimization
We study the fundamental limits to communication-efficient distributed
methods for convex learning and optimization, under different assumptions on
the information available to individual machines, and the types of functions
considered. We identify cases where existing algorithms are already worst-case
optimal, as well as cases where room for further improvement is still possible.
Among other things, our results indicate that without similarity between the
local objective functions (due to statistical data similarity or otherwise)
many communication rounds may be required, even if the machines have unbounded
computational power
On Lower and Upper Bounds in Smooth Strongly Convex Optimization - A Unified Approach via Linear Iterative Methods
In this thesis we develop a novel framework to study smooth and strongly
convex optimization algorithms, both deterministic and stochastic. Focusing on
quadratic functions we are able to examine optimization algorithms as a
recursive application of linear operators. This, in turn, reveals a powerful
connection between a class of optimization algorithms and the analytic theory
of polynomials whereby new lower and upper bounds are derived. In particular,
we present a new and natural derivation of Nesterov's well-known Accelerated
Gradient Descent method by employing simple 'economic' polynomials. This rather
natural interpretation of AGD contrasts with earlier ones which lacked a
simple, yet solid, motivation. Lastly, whereas existing lower bounds are only
valid when the dimensionality scales with the number of iterations, our lower
bound holds in the natural regime where the dimensionality is fixed.Comment: A related paper co-authored with Shai Shalev-Shwartz and Ohad Shamir
is to be published soo
Oracle Complexity of Second-Order Methods for Smooth Convex Optimization
Second-order methods, which utilize gradients as well as Hessians to optimize
a given function, are of major importance in mathematical optimization. In this
work, we prove tight bounds on the oracle complexity of such methods for smooth
convex functions, or equivalently, the worst-case number of iterations required
to optimize such functions to a given accuracy. In particular, these bounds
indicate when such methods can or cannot improve on gradient-based methods,
whose oracle complexity is much better understood. We also provide
generalizations of our results to higher-order methods.Comment: 35 pages; Added discussion of matching upper bounds, and
generalization to higher-order method
A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates
We provide tight finite-time convergence bounds for gradient descent and
stochastic gradient descent on quadratic functions, when the gradients are
delayed and reflect iterates from rounds ago. First, we show that
without stochastic noise, delays strongly affect the attainable optimization
error: In fact, the error can be as bad as non-delayed gradient descent ran on
only of the gradients. In sharp contrast, we quantify how stochastic
noise makes the effect of delays negligible, improving on previous work which
only showed this phenomenon asymptotically or for much smaller delays. Also, in
the context of distributed optimization, the results indicate that the
performance of gradient descent with delays is competitive with synchronous
approaches such as mini-batching. Our results are based on a novel technique
for analyzing convergence of optimization algorithms using generating
functions
Symmetry & critical points for a model shallow neural network
We consider the optimization problem associated with fitting two-layer ReLU
networks with hidden neurons, where labels are assumed to be generated by a
(teacher) neural network. We leverage the rich symmetry exhibited by such
models to identify various families of critical points and express them as
power series in . These expressions are then used to derive
estimates for several related quantities which imply that not all spurious
minima are alike. In particular, we show that while the loss function at
certain types of spurious minima decays to zero like , in other cases
the loss converges to a strictly positive constant. The methods used depend on
symmetry, the geometry of group actions, bifurcation, and Artin's implicit
function theorem
On the Principle of Least Symmetry Breaking in Shallow ReLU Models
We consider the optimization problem associated with fitting two-layer ReLU
networks with respect to the squared loss, where labels are assumed to be
generated by a target network. Focusing first on standard Gaussian inputs, we
show that the structure of spurious local minima detected by stochastic
gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of
symmetry} with respect to the target weights. A closer look at the analysis
indicates that this principle of least symmetry breaking may apply to a broader
range of settings. Motivated by this, we conduct a series of experiments which
corroborate this hypothesis for different classes of non-isotropic non-product
distributions, smooth activation functions and networks with a few layers
Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry
We consider the optimization problem associated with fitting two-layers ReLU
networks with neurons. We leverage the rich symmetry structure to
analytically characterize the Hessian and its spectral density at various
families of spurious local minima. In particular, we prove that for standard
-dimensional Gaussian inputs with : (a) of the eigenvalues
corresponding to the weights of the first layer, concentrate near
zero, (b) of the remaining eigenvalues grow linearly with .
Although this phenomenon of extremely skewed spectrum has been observed many
times before, to the best of our knowledge, this is the first time it has been
established rigorously. Our analytic approach uses techniques, new to the
field, from symmetry breaking and representation theory, and carries important
implications for our ability to argue about statistical generalization through
local curvature
On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions
Recent advances in randomized incremental methods for minimizing -smooth
-strongly convex finite sums have culminated in tight complexity of
and ,
where and , respectively, and denotes the number of
individual functions. Unlike incremental methods, stochastic methods for finite
sums do not rely on an explicit knowledge of which individual function is being
addressed at each iteration, and as such, must perform at least
iterations to obtain -optimal solutions. In this work, we exploit the
finite noise structure of finite sums to derive a matching -upper bound
under the global oracle model, showing that this lower bound is indeed tight.
Following a similar approach, we propose a novel adaptation of SVRG which is
both \emph{compatible with stochastic oracles}, and achieves complexity bounds
of and
, for and , respectively. Our bounds hold
w.h.p. and match in part existing lower bounds of
and
, for and ,
respectively